December 21, 2018
NYC’s Citi Bike is the largest bike share system in the country with over 800 stations. An inherent challenge in operating a bikeshare system is the uneven distribution of demand and the resultant difficulty in ensuring the availability of bikes or docks for users where and when they need them. To cope with this, bikeshare operators manually redistribute or rebalance bikes based on spatial and temporal demand trends.
This map shows the extent of Citi Bike stations, which spans three boroughs and also includes Jersey City. For the purposes of this study, we focus on 736 docks in New York only.
ggplot(all_stations) +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(aes(x = lon, y = lat, color = Region), size = 1) +
coord_sf(xlim = c(min(all_stations$lon),max(all_stations$lon)),
ylim = c(min(all_stations$lat),max(all_stations$lat))) +
labs(title = "Citi Bike stations") +
mapTheme() +
guides(colour = guide_legend(override.aes = list(size=3)))
for (i in 0:23) {
gridExtra::grid.arrange(
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_time %>%
filter(hour == i, wknd == 0) %>%
arrange(dest - orig),
aes(X, Y,
color = dest - orig,
size = abs(dest - orig))) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_size(range = c(0.5, 3), guide = "none") +
scale_color_gradientn(colors = c("limegreen",
"limegreen","gray20",
"magenta","magenta"),
limits = c(-60,60)) +
labs(title = paste0("Existing need for rebalancing \n(hour: ",i,")"),
color = "Excess bikes")
)
}
The need for rebalancing comes from the spatial and temporal asymmetry in supply and demand for bikes. In the am peak, there tends to be an excess of bikes in job centers like the Financial District or Midtown, meaning that people are traveling to these areas from their residences in other parts of the city. The trend flips in the PM peak as people return home to areas like the Upper West Side and Lower East Side.
Rebalancing can be costly, requiring special vehicles, labor, and time, and can be made even more costly by inefficient routing especially during peak periods with high traffic. With such a huge system in such a large city, rebalancing becomes a formidable task, making machine learning algorithms that can predict demand and supply and allow more efficient rebalancing activity all the more valuable. Using predictive modeling and machine learning can help facilitate the efficient redistribution of bikes where and when they are needed, meeting customer demand and helping boost ridership while ensuring efficient use of resources.
To facilitate the rebalancing of bikes in the Citi Bike system, we are proposing a new and improved Citi Bike app. The app will have functionality both for operators, who will be redistributing multiple bikes at a time in special vehicles, and for users who can redistribute bikes in exchange for points and perks.
The app, using predicted demand and supply for each station, will inform operators of the number of excess bikes or docks at any station in real time. Stations will be assigned priority values, based on the model-derived immediacy of their need for rebalancing. The app will also have a routing capability, creating efficient routes for operators to follow to pick up and drop off bikes, while taking into account each station’s rebalancing priority level.
The customer interface will provide similar information to the operator version of the app, but in a gamified manner. Rather than displaying the number of bikes needed to be rebalanced at each station, the app will show customers point values derived from the rebalancing priority level of each station. When members rebalance bikes from or to these stations, the corresponding point value will be added to their account, as part of Citi Bike’s Bike Angel program. These points will then be redeemable for various rewards, including extended memberships. This version of the app will also have a routing capability, but one that is based on an origin and destination inputted by the user. When displaying routing options, however, the app will provide an option for the most direct route, as well as an option for a route that prioritizes stations that need rebalancing.
To find out if any particular station needs rebalancing, we create two models predicting hourly demand for and supply of bikes to each station. The amount of bikes needed to be rebalanced at any given station at any given hour is the difference between the two values. If demand exceeds supply, bikes need to be added, and vice versa.
To elaborate, the demand of bikes at a Citi Bike station is equal to the number of trips that originated from that station during that hour. The supply of bikes is the number of trips whose destination is that station. These values can be aggregated from publicly available Citi Bike data (citibikenyc.com/system-data). The original data from May 2018 included 1.8 million trips.
gridExtra::grid.arrange(
qplot(data = plot_data, x = orig, geom = "histogram") +
labs(title = "Citi Bike trip volumes",
subtitle = "Week of May 13 to May 19, 2018",
x = "Hourly departures\n(demand for bikes)") +
plotTheme() +
theme(axis.text.y = element_blank()) +
coord_cartesian(xlim = c(0,50), ylim = c(4500,100000)),
qplot(data = plot_data, x = orig, geom = "histogram") +
labs(title = "", subtitle = "",
x = "Hourly arrivals\n(supply of bikes)") +
plotTheme() +
theme(axis.text.y = element_blank()) +
coord_cartesian(xlim = c(0,50), ylim = c(4500,100000)),
nrow = 1)
Demand for and supply of bikes varies both spatially and temporally. On weekdays, the number of Citi Bike trips spikes at the start and end of the work day. On weekends, the number of trips gradually rises, then falls. We also expect that weather conditions affect how much people want to ride, as historical weather data shows rain on Tuesday, Wednesday, and Thursday, the days with less trips in the plot above.
ggplot(plot_hour) +
geom_line(aes(x = hour, y = trip,
color = wday,
linetype = wday)) +
plotTheme() +
coord_cartesian(xlim = c(1,23),ylim = c(350,7650)) +
labs(title = "Systemwide Citi Bike trips",
subtitle = "Week of May 13 to May 19, 2018",
y = element_blank(), x = "Time (hour)",
color = "Day", linetype = "Day") +
scale_linetype_manual(values = c("dashed","solid","solid",
"solid","solid","solid",
"dashed")) +
scale_color_manual(values = c("black",
"red","orange","chartreuse","turquoise","blue",
"gray70"))
Certain areas like Midtown Manhattan experience a high supply of bikes during the AM peak, and a high demand for them in the PM peak. This reflects commuting behaviors as residents go to work in the morning and return at night.
demand_time <- function(TIME,WKND) {
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_time %>%
filter(hour == TIME, wknd == WKND) %>%
arrange(orig),
aes(X, Y,
color = orig,
size = orig)) +
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
"limegreen","limegreen")) +
theme(legend.position="none") +
labs(subtitle = "Departures")
}
supply_time <- function(TIME,WKND) {
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_time %>%
filter(hour == TIME, wknd == WKND) %>%
arrange(dest),
aes(X, Y,
color = dest,
size = dest)) +
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
"magenta","magenta")) +
theme(legend.position="none") +
labs(subtitle = "Arrivals")
}
for (i in 0:23) {
gridExtra::grid.arrange(
demand_time(i,0) + labs(title = paste0("Weekly Citi Bike activity (hour: ",i,")")),
supply_time(i,0) + labs(title = ""),
nrow = 1
)
}
| Departures | Arrivals | |
|---|---|---|
| Nearest 2 | 0.000024 | 0.0000051 |
| Nearest 5 | 0.006110 | 0.0059500 |
| Nearest 10 | 0.020000 | 0.0930000 |
Each station is likely to follow the same trends every week, as commuters are likely to take the same station. Furthermore, riders also tend to choose between a variety of stations, especially when they’re clustered together. A significant p-value from the Moran’s I test indicates that nearby stations are, indeed, autocorrelated. To account for this autocorrelation, we introduce some spatial and time lag variables as well.
| Category | Variable | Description |
|---|---|---|
| Dependent variables | orig | Trip origins (demand for bikes / departures from each station at each hour) |
| dest | Trip destinations (supply of bikes / arrivals from each station at each hour) | |
| Station characteristics | capacity | Number of docks |
| GEOID | Census tract fixed effects should capture spatial and geographic features including land use, sociodemographic characteristics, residence and employment distribution, etc. | |
| ZIPCODE | ZIP code fixed effects | |
| dist_nn2 | Average distance to nearest 2 neighbors | |
| dist_nn5 | Average distance to nearest 5 neighbors | |
| dist_nn10 | Average distance to nearest 10 neighbors | |
| Time | wday | Day of the week (Sun, Mon, etc) |
| wknd | Trip occurred on a weekend (1 = yes, 0 = no) | |
| hour | Hour of day (0-23) | |
| time | Time of day (night = 0-5, morning = 6-9, midday = 10-15, afternoon = 16-19, evening = 20-23) | |
| Weather | temp | Wet bulb temperature in degrees F (noaa.gov) |
| wind | Wind speed in miles per hour (noaa.gov) | |
| humid | Relative humidity in percent (noaa.gov) | |
| precip | Precipitation in inches (noaa.gov) | |
| Temporal lag | *_lh | Origins or destinations in the last hour |
| *_l3hr | Origins or destinations in the last three hours | |
| *_lw | Origins or destinations this hour, one week ago | |
| *_lw_3hr | Origins or destinations in the previous three hours, one week ago | |
| *_day | Origins or destinations for this day, one week ago | |
| *_time | Origins or destinations for this time, one week ago | |
| Spatial lag | *_nn2 | All temporal lag variables for the average of the nearest 2 neighbors |
| *_nn5 |
All temporal lag variables for the average of the nearest 2 neighbors |
|
| *_nn10 |
All temporal lag variables for the average of the nearest 2 neighbors |
|
| cap_nn2 | Average capacity of nearest 2 neighbors | |
| cap_nn5 | Average capacity of nearest 5 neighbors | |
| cap_nn10 | Average capacity of nearest 10 neighbors |
Because demand for and supply of bikes each hour are counts with many zero values, we build our models using a Poisson regression. Poisson regressions assume that the dependent variables (in this case, demand and supply) have a Poisson distribution and can be predicted with some linear combination of contributing factors.
Using a forward stepwise procedure, we incrementally add and select the most significant contributing factors that give accurate and generalizable models. For our demand model, there are 39 independent variables; for the supply model, there are 33. Interestingly, this is a large number of variables that proved signficant; it’s likely than more feature engineering will prove helpful.
Demand model (predicting departures)
summary(mod_dep10$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -21.4958 -1.3208 -0.6108 0.3840 17.3369
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.8453639 0.0384561 -21.983
## dist_nn2 -130.9854350 6.1392000 -21.336
## dist_nn5 131.9500930 11.1575912 11.826
## dist_nn10 -47.3410421 7.2269898 -6.551
## time.L 0.9998751 0.0085045 117.570
## time.Q -1.4100442 0.0083496 -168.875
## time.C 0.4460193 0.0054882 81.268
## `time^4` -0.4969285 0.0051327 -96.816
## wday.L -0.1151926 0.0065260 -17.651
## wday.Q -0.1425098 0.0095538 -14.917
## wday.C -0.1218532 0.0057609 -21.152
## `wday^4` -0.3349198 0.0060525 -55.336
## `wday^5` -0.0361749 0.0061012 -5.929
## `wday^6` 0.0774497 0.0058876 13.155
## capacity 0.0105658 0.0002123 49.777
## dest_lh 0.0012698 0.0003861 3.289
## orig_lh 0.0052830 0.0003255 16.229
## orig_lw 0.0135933 0.0001545 87.974
## dest_time -0.0151999 0.0003450 -44.056
## orig_day 0.0478420 0.0024620 19.432
## dest_day 0.0399845 0.0024445 16.357
## first_nn2 -0.2406566 0.0074941 -32.113
## excess_net2 -0.0035671 0.0003789 -9.414
## orig_nn2 -0.0034958 0.0007696 -4.542
## dest_lw_nn2 0.0027711 0.0006247 4.436
## orig_day_nn2 0.0104140 0.0011826 8.806
## cap_nn5 -0.0251648 0.0006823 -36.882
## first_nn5 0.0174613 0.0075823 2.303
## excess_net5 0.0023500 0.0002620 8.968
## orig_nn5 0.0142640 0.0013850 10.299
## orig_lh_nn5 -0.0031417 0.0010839 -2.898
## dest_time_nn5 -0.0045999 0.0011388 -4.039
## cap_nn10 0.0433325 0.0008807 49.202
## first_nn10 0.0604756 0.0055394 10.917
## excess_net10 0.0050084 0.0001676 29.874
## orig_nn10 0.0844361 0.0015702 53.773
## orig_lh_nn10 0.0219792 0.0015074 14.581
## dest_lh_nn10 -0.0204355 0.0010888 -18.768
## dest_lw_nn10 -0.0214091 0.0010711 -19.988
## orig_time_nn10 -0.0188923 0.0012985 -14.549
## orig_day_nn10 0.0417717 0.0098312 4.249
## dest_day_nn10 -0.0488454 0.0094126 -5.189
## HOURLYWETBULBTEMPF 0.0057227 0.0005121 11.175
## HOURLYWindSpeed -0.0201322 0.0009108 -22.103
## HOURLYRelativeHumidity -0.0035339 0.0001894 -18.654
## precip 1.1088035 0.0936744 11.837
## dest_l3hr 0.0079671 0.0005163 15.431
## orig_l3hr 0.0011425 0.0005072 2.253
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## dist_nn2 < 0.0000000000000002 ***
## dist_nn5 < 0.0000000000000002 ***
## dist_nn10 0.0000000000573 ***
## time.L < 0.0000000000000002 ***
## time.Q < 0.0000000000000002 ***
## time.C < 0.0000000000000002 ***
## `time^4` < 0.0000000000000002 ***
## wday.L < 0.0000000000000002 ***
## wday.Q < 0.0000000000000002 ***
## wday.C < 0.0000000000000002 ***
## `wday^4` < 0.0000000000000002 ***
## `wday^5` 0.0000000030458 ***
## `wday^6` < 0.0000000000000002 ***
## capacity < 0.0000000000000002 ***
## dest_lh 0.00101 **
## orig_lh < 0.0000000000000002 ***
## orig_lw < 0.0000000000000002 ***
## dest_time < 0.0000000000000002 ***
## orig_day < 0.0000000000000002 ***
## dest_day < 0.0000000000000002 ***
## first_nn2 < 0.0000000000000002 ***
## excess_net2 < 0.0000000000000002 ***
## orig_nn2 0.0000055614405 ***
## dest_lw_nn2 0.0000091581861 ***
## orig_day_nn2 < 0.0000000000000002 ***
## cap_nn5 < 0.0000000000000002 ***
## first_nn5 0.02128 *
## excess_net5 < 0.0000000000000002 ***
## orig_nn5 < 0.0000000000000002 ***
## orig_lh_nn5 0.00375 **
## dest_time_nn5 0.0000536610023 ***
## cap_nn10 < 0.0000000000000002 ***
## first_nn10 < 0.0000000000000002 ***
## excess_net10 < 0.0000000000000002 ***
## orig_nn10 < 0.0000000000000002 ***
## orig_lh_nn10 < 0.0000000000000002 ***
## dest_lh_nn10 < 0.0000000000000002 ***
## dest_lw_nn10 < 0.0000000000000002 ***
## orig_time_nn10 < 0.0000000000000002 ***
## orig_day_nn10 0.0000214847612 ***
## dest_day_nn10 0.0000002110048 ***
## HOURLYWETBULBTEMPF < 0.0000000000000002 ***
## HOURLYWindSpeed < 0.0000000000000002 ***
## HOURLYRelativeHumidity < 0.0000000000000002 ***
## precip < 0.0000000000000002 ***
## dest_l3hr < 0.0000000000000002 ***
## orig_l3hr 0.02429 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 687401 on 124383 degrees of freedom
## Residual deviance: 263138 on 124336 degrees of freedom
## AIC: 456626
##
## Number of Fisher Scoring iterations: 6
Supply model (predicting arrivals)
summary(mod_arr10$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -25.8662 -1.3366 -0.6208 0.4024 11.0435
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.9427152 0.0386508 -24.391
## dist_nn2 -158.7799439 6.1280266 -25.910
## dist_nn5 177.4586871 11.1610849 15.900
## dist_nn10 -65.2706028 7.2474390 -9.006
## time.L 1.0456662 0.0082816 126.265
## time.Q -1.3079668 0.0081156 -161.168
## time.C 0.3854922 0.0054901 70.216
## `time^4` -0.4434657 0.0050527 -87.767
## wday.L -0.1164255 0.0064981 -17.917
## wday.Q -0.0572144 0.0095759 -5.975
## wday.C -0.1189027 0.0057780 -20.579
## `wday^4` -0.3256106 0.0060630 -53.705
## `wday^5` 0.0093756 0.0061112 1.534
## `wday^6` 0.0775754 0.0059298 13.082
## capacity 0.0099184 0.0002122 46.748
## dest_lh 0.0037569 0.0002421 15.518
## orig_lh 0.0031754 0.0003511 9.044
## dest_lw 0.0144099 0.0002437 59.140
## orig_lw -0.0019523 0.0002807 -6.956
## orig_time -0.0200494 0.0005018 -39.958
## dest_time 0.0141580 0.0004610 30.712
## orig_day 0.0643461 0.0023072 27.889
## dest_day 0.0182646 0.0023065 7.919
## first_nn2 -0.2482067 0.0074578 -33.281
## excess_net2 -0.0042153 0.0004157 -10.140
## orig_nn2 -0.0097636 0.0008516 -11.465
## orig_lw_nn2 0.0034259 0.0008103 4.228
## dest_lw_nn2 0.0064262 0.0007490 8.580
## orig_time_nn2 -0.0085416 0.0011562 -7.388
## orig_day_nn2 0.0192243 0.0016494 11.655
## cap_nn5 -0.0222072 0.0007074 -31.393
## first_nn5 0.0321000 0.0075764 4.237
## excess_net5 0.0070572 0.0002575 27.407
## orig_nn5 0.0232214 0.0016445 14.121
## orig_lh_nn5 -0.0044842 0.0011685 -3.838
## orig_lw_nn5 0.0045396 0.0016172 2.807
## dest_lw_nn5 -0.0131555 0.0013655 -9.634
## orig_time_nn5 0.0177190 0.0023314 7.600
## orig_day_nn5 -0.0247999 0.0033243 -7.460
## cap_nn10 0.0403879 0.0008993 44.911
## first_nn10 0.0682479 0.0055322 12.336
## orig_nn10 0.0587683 0.0014771 39.787
## orig_lh_nn10 0.0240463 0.0013199 18.219
## orig_lw_nn10 -0.0124289 0.0016537 -7.516
## dest_lw_nn10 0.0094134 0.0012289 7.660
## orig_time_nn10 -0.0514587 0.0023060 -22.315
## orig_day_nn10 0.0142676 0.0032913 4.335
## HOURLYWETBULBTEMPF 0.0096334 0.0005148 18.712
## HOURLYWindSpeed -0.0185605 0.0009051 -20.506
## HOURLYRelativeHumidity -0.0049450 0.0001874 -26.385
## precip 0.5608956 0.0949327 5.908
## orig_l3hr 0.0036426 0.0004503 8.090
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## dist_nn2 < 0.0000000000000002 ***
## dist_nn5 < 0.0000000000000002 ***
## dist_nn10 < 0.0000000000000002 ***
## time.L < 0.0000000000000002 ***
## time.Q < 0.0000000000000002 ***
## time.C < 0.0000000000000002 ***
## `time^4` < 0.0000000000000002 ***
## wday.L < 0.0000000000000002 ***
## wday.Q 0.000000002303087518 ***
## wday.C < 0.0000000000000002 ***
## `wday^4` < 0.0000000000000002 ***
## `wday^5` 0.124988
## `wday^6` < 0.0000000000000002 ***
## capacity < 0.0000000000000002 ***
## dest_lh < 0.0000000000000002 ***
## orig_lh < 0.0000000000000002 ***
## dest_lw < 0.0000000000000002 ***
## orig_lw 0.000000000003503069 ***
## orig_time < 0.0000000000000002 ***
## dest_time < 0.0000000000000002 ***
## orig_day < 0.0000000000000002 ***
## dest_day 0.000000000000002399 ***
## first_nn2 < 0.0000000000000002 ***
## excess_net2 < 0.0000000000000002 ***
## orig_nn2 < 0.0000000000000002 ***
## orig_lw_nn2 0.000023598317609794 ***
## dest_lw_nn2 < 0.0000000000000002 ***
## orig_time_nn2 0.000000000000149121 ***
## orig_day_nn2 < 0.0000000000000002 ***
## cap_nn5 < 0.0000000000000002 ***
## first_nn5 0.000022670227377094 ***
## excess_net5 < 0.0000000000000002 ***
## orig_nn5 < 0.0000000000000002 ***
## orig_lh_nn5 0.000124 ***
## orig_lw_nn5 0.005001 **
## dest_lw_nn5 < 0.0000000000000002 ***
## orig_time_nn5 0.000000000000029558 ***
## orig_day_nn5 0.000000000000086468 ***
## cap_nn10 < 0.0000000000000002 ***
## first_nn10 < 0.0000000000000002 ***
## orig_nn10 < 0.0000000000000002 ***
## orig_lh_nn10 < 0.0000000000000002 ***
## orig_lw_nn10 0.000000000000056568 ***
## dest_lw_nn10 0.000000000000018582 ***
## orig_time_nn10 < 0.0000000000000002 ***
## orig_day_nn10 0.000014581349814439 ***
## HOURLYWETBULBTEMPF < 0.0000000000000002 ***
## HOURLYWindSpeed < 0.0000000000000002 ***
## HOURLYRelativeHumidity < 0.0000000000000002 ***
## precip 0.000000003455468658 ***
## orig_l3hr 0.000000000000000597 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 693982 on 124383 degrees of freedom
## Residual deviance: 259189 on 124332 degrees of freedom
## AIC: 452548
##
## Number of Fisher Scoring iterations: 6
We use 100-fold cross-validation, an out-of-sample validation method that randomly separates the data into 100 portions.
To further test whether our models can predict out-of-sample, we use our models to predict demand for and supply of bikes at all stations on Friday, May 25 and Saturday, May 26 the following week. By subtracting demand from supply, we can determine the excess number of Citi Bikes at any given station at all hours of the day.
plot_result <- output %>%
left_join(all_stations %>% select(station_id,lon,lat)) %>%
mutate(exc = dest - orig,
exc = ifelse(exc > 50, 50, exc),
orig = ifelse(orig > 50, 50, orig),
dest = ifelse(dest > 50, 50, dest))
demand_time <- function(DATE) {
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_result %>%
filter(date_id == DATE) %>%
arrange(orig),
aes(lon, lat,
color = orig,
size = orig)) +
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
"limegreen","limegreen")) +
theme(legend.position="none") +
labs(subtitle = "Departures")
}
supply_time <- function(DATE) {
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_result %>%
filter(date_id == DATE) %>%
arrange(dest),
aes(lon, lat,
color = dest,
size = dest)) +
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
"magenta","magenta")) +
theme(legend.position="none") +
labs(subtitle = "Arrivals")
}
excess_time <- function(DATE) {
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_result %>%
filter(date_id == DATE) %>%
arrange(abs(exc)),
aes(lon, lat,
color = exc,
size = abs(exc))) +
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("limegreen","limegreen","gray20",
"magenta","magenta")) +
theme(legend.position="none") +
labs(subtitle = "Excess")
}
for (i in unique(output$date_id)) {
date <- as.POSIXlt(i, origin="1970-01-01", tz="America/New_York") + dhours(4)
date <- paste0(wday(date, label = TRUE)," ",hour(date),":00")
gridExtra::grid.arrange(
demand_time(i),
supply_time(i),
excess_time(i),
nrow = 1,
top = grid::textGrob(paste0("Predicted activity (",date,")"),
gp=grid::gpar(fontsize=14),
hjust = 1)
)
}
Negative numbers mean a shortage of bikes.
output %>%
filter(date(date_id) == ymd("2018-05-25")) %>%
mutate(hour = paste0(hour(date_id),":00")) %>%
arrange(desc(abs(exc_pred))) %>%
head(10) %>%
arrange(hour(date_id)) %>%
left_join(all_stations %>% select(station_id,name)) %>%
select(Station = name,
Time = hour,
`Excess bikes` = exc_pred) %>%
kable(caption = "Friday, May 25", digits = 1) %>%
kable_styling(bootstrap_options = c("hover"))
| Station | Time | Excess bikes |
|---|---|---|
| W 33 St & 7 Ave | 7:00 | -164.2 |
| W 33 St & 7 Ave | 8:00 | -85.0 |
| W 52 St & 5 Ave | 8:00 | 68.0 |
| Broadway & E 22 St | 9:00 | 71.5 |
| W 33 St & 7 Ave | 15:00 | 67.9 |
| W 33 St & 7 Ave | 16:00 | 607.2 |
| Pershing Square North | 17:00 | -197.3 |
| W 33 St & 7 Ave | 17:00 | 195.7 |
| W 31 St & 7 Ave | 17:00 | 151.1 |
| Pershing Square North | 18:00 | -115.6 |
output %>%
filter(date(date_id) == ymd("2018-05-26")) %>%
mutate(hour = paste0(hour(date_id),":00")) %>%
arrange(desc(abs(exc_pred))) %>%
head(10) %>%
arrange(hour(date_id)) %>%
left_join(all_stations %>% select(station_id,name)) %>%
select(Station = name,
Time = hour,
`Excess bikes` = exc_pred) %>%
kable(caption = "Saturday, May 26", digits = 1) %>%
kable_styling(bootstrap_options = c("hover"))
| Station | Time | Excess bikes |
|---|---|---|
| E 15 St & 3 Ave | 12:00 | -3.1 |
| Grand Army Plaza & Central Park S | 16:00 | -4.4 |
| St Marks Pl & 1 Ave | 16:00 | 3.8 |
| Broadway & W 49 St | 16:00 | 2.9 |
| E 11 St & 1 Ave | 16:00 | 2.5 |
| E 15 St & 3 Ave | 17:00 | 3.2 |
| 6 Ave & Canal St | 17:00 | 2.5 |
| St Marks Pl & 1 Ave | 18:00 | 4.1 |
| Grand Army Plaza & Central Park S | 18:00 | -3.5 |
| St Marks Pl & 1 Ave | 19:00 | 3.0 |
In the predictions for Friday above, we can immediately tell that the models need to be validated, because some stations report an exorbitant number upwards of 600 bikes that need to be rebalanced. Two areas of model validation are accuracy, or the magnitude of error, and generalizability, or the distribution of error across variables.
General model accuracy can be measured using Mean Absolute Error (MAE), which a type of average that can measure the magnitude of error even when some error values are negative. A second type of error, Mean Bias Error (MBE) takes the average without adjusting for negative values. This type of error can reveal whether, on average, predicted values are higher (MBE > 0) or lower (MBE < 0) than observed values.
The table below indicates the error for our out-sample prediction. An MAE of about 2 means that on average, the models’ predictions are off by 2 bikes. On weekends, both models have a negative MBE, indicating that they are likely to under-predict the number of arrivals and departures.
| Day | Demand MAE | Demand MBE | Supply MAE | Supply MBE |
|---|---|---|---|---|
| Fri | 2.33 | 0.04 | 2.38 | 0.20 |
| Sat | 2.12 | -1.17 | 2.10 | -0.99 |
| Both days | 2.22 | -0.57 | 2.24 | -0.39 |
Other types of error include Mean Average Percent Error (MAPE) and Root Mean Square Error (RMSE). However, MAPE is unsuitable to measure deviation from zero and RMSE over-emphasizes high error values. The absolute magnitude of error (i.e. MAE) is most useful because it reflects, on average, the number of bikes per station each hour that our model mis-predicts.
Model generalizability is a measure of error that shows not the magnitude but the variation in error at different times, places, or across some variable. First, we can see the distribution of error across the 100 random folds from our 100-fold regression. For most folds, the error is clustered near the MAE of the final models. Both the supply and demand models are right-tailed, indicating that some folds have a particularly high error. The next few sections look into exactly where and when our models are underperforming.
gridExtra::grid.arrange(
ggplot(mod_dep10$resample) +
geom_histogram(aes(x = MAE)) +
plotTheme() +
coord_cartesian(xlim = c(1.5,2.5), ylim = c(1,25)) +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank()) +
labs(title = "Distribution of error across 100 random folds",
subtitle = "Demand (departures)"),
ggplot(mod_arr10$resample) +
geom_histogram(aes(x = MAE)) +
plotTheme() +
coord_cartesian(xlim = c(1.5,2.5), ylim = c(1,25)) +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank()) +
labs(title = "",
subtitle = "Supply (arrivals)"),
nrow = 1)
Next is a comparison of error over time. On weekdays, both models perform relatively well, typically sligntly under-predicting in the morning, then over-predicting during the PM peak. On weekends, however, the models systematically under predict the number of arrivals and departures at each station. This may indicate that some time-based variable needs to be added. Another possibility includes making separate models for weekdays and weekends, as it’s been illustrated that weekdays and weekends see very different activity patterns.
time_err <- output %>%
group_by(date_id) %>%
summarize(orig = mean(orig, na.rm = TRUE),
orig_pred = mean(orig_pred, na.rm = TRUE),
orig_err = mean(orig_pred - orig, na.rm = TRUE),
orig_mae = mean(abs(orig_pred - orig),na.rm = TRUE),
dest = mean(dest, na.rm = TRUE),
dest_pred = mean(dest_pred, na.rm = TRUE),
dest_err = mean(dest_pred - dest, na.rm = TRUE),
dest_mae = mean(abs(dest_pred - dest),na.rm = TRUE)) %>%
mutate(hour = hour(date_id))
gridExtra::grid.arrange(
ggplot(data = time_err %>%
rbind(time_err %>%
filter(hour == 0) %>%
mutate(hour = 24)) %>%
filter(wday(date_id) == 6)) +
geom_area(aes(x = hour, y = orig), fill = "gray", alpha = 0.25) +
geom_line(aes(x = hour, y = orig), color = "gray",size = 1) +
geom_area(aes(x = hour, y = orig_pred), fill = "limegreen", alpha = 0.25) +
geom_line(aes(x = hour, y = orig_pred),
color = "limegreen", size = 1) +
geom_line(aes(x = hour, y = (orig_err)),
color = "black", size = 1, linetype = "1111") +
plotTheme() +
scale_x_continuous(breaks = c(0,6,12,18,24)) +
coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
labs(title = "Friday predictions by hour",
subtitle = "Demand (departures)",
x = "Hour", y = "Bikes"),
ggplot(data = time_err %>%
rbind(time_err %>%
filter(hour == 0) %>%
mutate(hour = 24)) %>%
filter(wday(date_id) == 6)) +
geom_area(aes(x = hour, y = dest), fill = "gray", alpha = 0.25) +
geom_line(aes(x = hour, y = dest), color = "gray",size = 1) +
geom_area(aes(x = hour, y = dest_pred), fill = "magenta", alpha = 0.25) +
geom_line(aes(x = hour, y = dest_pred),
color = "magenta", size = 1) +
geom_line(aes(x = hour, y = (dest_err)),
color = "black", size = 1, linetype = "1111") +
plotTheme() +
scale_x_continuous(breaks = c(0,6,12,18,24)) +
coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
labs(title = "(error in black)",
subtitle = "Supply (arrivals)",
x = "Hour", y = ""),
nrow = 1)
gridExtra::grid.arrange(
ggplot(data = time_err %>%
rbind(time_err %>%
filter(hour == 0) %>%
mutate(hour = 24)) %>%
filter(wday(date_id) == 7)) +
geom_area(aes(x = hour, y = orig), fill = "gray", alpha = 0.25) +
geom_line(aes(x = hour, y = orig), color = "gray", size = 1) +
geom_area(aes(x = hour, y = orig_pred), fill = "limegreen", alpha = 0.25) +
geom_line(aes(x = hour, y = orig_pred),
color = "limegreen", size = 1) +
geom_line(aes(x = hour, y = (orig_err)),
color = "black", size = 1, linetype = "1111") +
plotTheme() +
scale_x_continuous(breaks = c(0,6,12,18,24)) +
coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
labs(title = "Saturday predictions by hour",
subtitle = "Demand (departures)",
x = "Hour", y = "Bikes"),
ggplot(data = time_err %>%
rbind(time_err %>%
filter(hour == 0) %>%
mutate(hour = 24)) %>%
filter(wday(date_id) == 7)) +
geom_area(aes(x = hour, y = dest), fill = "gray", alpha = 0.25) +
geom_line(aes(x = hour, y = dest), color = "gray",size = 1) +
geom_area(aes(x = hour, y = dest_pred), fill = "magenta", alpha = 0.25) +
geom_line(aes(x = hour, y = dest_pred),
color = "magenta", size = 1) +
geom_line(aes(x = hour, y = (dest_err)),
color = "black", size = 1, linetype = "1111") +
plotTheme() +
scale_x_continuous(breaks = c(0,6,12,18,24)) +
coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
labs(title = "(error in black)",
subtitle = "Supply (arrivals)",
x = "Hour", y = ""),
nrow = 1)
Lastly, the maps below indicate that stations in central Manhattan are likely to be predicted with significant error. These are also high-volume stations that see a lot of activity, so it might be expected that the error is higher as well.
space_err <- output %>%
group_by(station_id) %>%
summarize(orig_err = mean(abs(orig_pred - orig), na.rm = TRUE),
dest_err = mean(abs(dest_pred - dest), na.rm = TRUE)) %>%
left_join(all_stations %>% select(station_id,lon,lat))
gridExtra::grid.arrange(
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = space_err %>% arrange(orig_err),
aes(color = orig_err, size = orig_err,
x = lon, y = lat)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20","red","red","red","red","red","red"),
limits = c(0,40)) +
scale_size(range = c(0.5,3)) +
labs(title = "Mean absolute error by station",
subtitle = "Demand (departures)") +
theme(legend.position = "none"),
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = space_err %>% arrange(dest_err),
aes(color = dest_err, size = dest_err,
x = lon, y = lat)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20","red","red","red","red","red","red"),
limits = c(0,40)) +
scale_size(range = c(0.5,3)) +
labs(title = "",subtitle = "Supply (arrivals)") +
theme(legend.position = "none"),
nrow = 1)
Next steps to improve the model could include:
Building a model to predict the excess bike demand at each station (demand-supply), rather than deriving the predicted excess demand from two separate models predicting demand and supply. Directly predicting this variable may have the potential to provide more robust predictions.
Expanding the model to take into account the current availability of bikes and docks - when combined with predicted excess demand values, this will allow the model to meet the predictive needs of the app and convey information on how many bikes are projected to be available throughout the day.